-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1786: Edge Partition Serialization #724
Conversation
Build triggered. |
Build started. |
Merged build triggered. |
Merged build started. |
Build finished. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14875/ |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14877/ |
@jegonzal should this be added in 1.0? |
I would like to get it into 1.0 if possible. Otherwise, we could run into issues if the user persists graphs to disk or straggler mitigation is used. @ankurdave do you see any issues with trying to get this into 1.0? |
Alright, sounds good. @ankurdave or @rxin can you take a quick look? |
This looks good to me. Re-enabling Kryo reference tracking will have a performance penalty, but we can easily fix that after the release. |
Alright, then I'll merge this as is. You guys should add some docs in both the GraphX programming guide and GraphXKryoSerializer to mention that it's recommended to turn off reference tracking. Just send a separate PR for that. (Doc changes can also go in after 1.0 is officially cut, we can update the website). |
btw as far as I can tell Kryo reference should always be disabled in the On Sunday, May 11, 2014, Matei Zaharia [email protected] wrote:
|
This appears to address the issue with edge partition serialization. The solution appears to be just registering the `PrimitiveKeyOpenHashMap`. However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing). I also renamed our local copy to `GraphXPrimitiveKeyOpenHashMap`. We should consider dropping that and using the one in Spark if possible. Author: Ankur Dave <[email protected]> Author: Joseph E. Gonzalez <[email protected]> Closes #724 from jegonzal/edge_partition_serialization and squashes the following commits: b0a525a [Ankur Dave] Disable reference tracking to fix serialization test bb7f548 [Ankur Dave] Add failing test for EdgePartition Kryo serialization 67dac22 [Joseph E. Gonzalez] Making EdgePartition serializable. (cherry picked from commit a6b02fb) Signed-off-by: Matei Zaharia <[email protected]>
Alternatively found a way to work around that in the repl so it can safely On Sunday, May 11, 2014, Matei Zaharia [email protected] wrote:
|
I think we can warn if it's on or something. I wouldn't add code to disable it since we might be able to fix it to work there too. |
My only concern is that I would prefer things work slowly than fail. With reference tracking disabled it is not possible to serialize user defined types from the spark-shell. A second concern is that it will be difficult for the user to enable reference tracking if we disable it in the GraphX Kryo registrar. |
This was merged without ever passing jenkins. I've reverted this because it's causing all other PR's to break. We need to add an exclude file to the binary check. |
To fix this we can just add the Joey - mind re-opening this? |
This appears to address the issue with edge partition serialization. The solution appears to be just registering the `PrimitiveKeyOpenHashMap`. However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing). I also renamed our local copy to `GraphXPrimitiveKeyOpenHashMap`. We should consider dropping that and using the one in Spark if possible. Author: Ankur Dave <[email protected]> Author: Joseph E. Gonzalez <[email protected]> Closes apache#724 from jegonzal/edge_partition_serialization and squashes the following commits: b0a525a [Ankur Dave] Disable reference tracking to fix serialization test bb7f548 [Ankur Dave] Add failing test for EdgePartition Kryo serialization 67dac22 [Joseph E. Gonzalez] Making EdgePartition serializable.
### What changes were proposed in this pull request? Added optimizer rule `RemoveRedundantAggregates`. It removes redundant aggregates from a query plan. A redundant aggregate is an aggregate whose only goal is to keep distinct values, while its parent aggregate would ignore duplicate values. The affected part of the query plan for TPCDS q87: Before: ``` == Physical Plan == *(26) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#785] +- *(25) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#724] +- *(24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti :- ... ``` After: ``` == Physical Plan == *(26) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#751] +- *(25) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#694] +- *(24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti :- ... ``` ### Why are the changes needed? Performance improvements - few TPCDS queries have these kinds of duplicate aggregates. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Benchmarks (sf=5): OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Linux 5.8.13-arch1-1 Intel(R) Core(TM) i5-6500 CPU 3.20GHz | Query | Before | After | Speedup | | ------| ------- | ------| ------- | | q14a | 44s | 44s | 1x | | q14b | 41s | 41s | 1x | | q38 | 6.5s | 5.9s | 1.1x | | q87 | 7.2s | 6.8s | 1.1x | | q14a-v2.7 | 55s | 53s | 1x | Closes #30018 from tanelk/SPARK-33122. Lead-authored-by: [email protected] <[email protected]> Co-authored-by: Tanel Kiis <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
* KE-43300 use crontab expression to periodicGC in KE * KE-43300 fix review
This appears to address the issue with edge partition serialization. The solution appears to be just registering the
PrimitiveKeyOpenHashMap
. However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing). I also renamed our local copy toGraphXPrimitiveKeyOpenHashMap
. We should consider dropping that and using the one in Spark if possible.